Introduction

In this project, you will explore the question of whether college education causally affects political participation. Specifically, you will use replication data from by former Berkeley PhD students John Henderson and Sara Chatfield. Their paper is itself a replication study of by Cindy Kam and Carl Palmer. In their original 2008 study, Kam and Palmer argue that college education has no effect on later political participation, and use the propensity score matching to show that pre-college political activity drives selection into college and later political participation. Henderson and Chatfield in their 2011 paper argue that the use of the propensity score matching in this context is inappropriate because of the bias that arises from small changes in the choice of variables used to model the propensity score. They use (at that point a new method), which uses an approach similar to optimal matching to optimize Mahalanobis distance weights. Even with genetic matching, they find that balance remains elusive however, thus leaving open the question of whether education causes political participation.

You will use these data and debates to investigate the benefits and pitfalls associated with matching methods. Replication code for these papers is available online, but as you’ll see, a lot has changed in the last decade or so of data science! Throughout the assignment, use tools we introduced in lab from the and the packages. Specifically, try to use dplyr, tidyr, purrr, stringr, and ggplot instead of base R functions. While there are other matching software libraries available, MatchIt tends to be the most up to date and allows for consistent syntax.

Data

The data is drawn from the which asked students and parents a variety of questions about their political participation. This survey was conducted in several waves. The first wave was in 1965 and established the baseline pre-treatment covariates. The treatment is whether the student attended college between 1965 and 1973 (the time when the next survey wave was administered). The outcome is an index that calculates the number of political activities the student engaged in after 1965. Specifically, the key variables in this study are:

Otherwise, we also have covariates measured for survey responses to various questions about political attitudes. We have covariates measured for the students in the baseline year, covariates for their parents in the baseline year, and covariates from follow-up surveys. . In general, post-treatment covariates will be clear from the name (i.e. student_1973Married indicates whether the student was married in the 1973 survey). Be mindful that the baseline covariates were all measured in 1965, the treatment occurred between 1965 and 1973, and the outcomes are from 1973 and beyond. We will distribute the Appendix from Henderson and Chatfield that describes the covariates they used, but please reach out with any questions if you have questions about what a particular variable means.

# Load tidyverse and MatchIt
# Feel free to load other libraries as you wish
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MatchIt)
suppressWarnings({
  log(-1) 
})
## [1] NaN
library(cobalt)
##  cobalt (Version 4.5.5, Build Date: 2024-04-02)
## 
## Attaching package: 'cobalt'
## The following object is masked from 'package:MatchIt':
## 
##     lalonde
# Load ypsps data
ypsps <- read_csv('data/ypsps.csv')
## Rows: 1254 Columns: 174
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (174): interviewid, college, student_vote, student_meeting, student_othe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ypsps[] <- lapply(ypsps, function(x) {
  if (inherits(x, "haven_labelled")) {
    return(as.numeric(x))
  } else {
    return(x)
  }
})

head(ypsps)

Randomization

Matching is usually used in observational studies to to approximate random assignment to treatment. But could it be useful even in randomized studies? To explore the question do the following:

# Generate a vector that randomly assigns each unit to treatment/control
set.seed(42)
test_ypsps <- ypsps %>% mutate(random_treat = sample(0:1,nrow(ypsps), replace = TRUE))

# Choose a baseline covariate (use dplyr for this)
test_ypsps <- test_ypsps %>% select(random_treat, parent_Vote)
  
# Visualize the distribution by treatment/control (ggplot)
balance_plot <- ggplot(test_ypsps, 
                       aes(x = factor(random_treat), fill = factor(parent_Vote))) +
  geom_bar(position = "fill") +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_fill_manual(values = c("red","steelblue")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    x = "Treatment assignment (0=Control, 1=Treatment)",  
    y = "Proportion",                                    
    fill = "Parent Vote (0=No, 1=Yes)",                  
    title = "Distribution of Covariate with Randomized Treatment Assignment"
  )

print(balance_plot)

treatment_prop <- test_ypsps %>%
  filter(random_treat == 1) %>%
  summarize(prop_vote = mean(parent_Vote)) %>%
  pull()

control_prop <- test_ypsps %>%
  filter(random_treat == 0) %>%
  summarize(prop_vote = mean(parent_Vote)) %>%
  pull()

imbalance <- treatment_prop - control_prop
chisq <- chisq.test(table(test_ypsps$random_treat, test_ypsps$parent_Vote))
cat("Imbalance in covariate between treatment and control:", imbalance, "\n")
## Imbalance in covariate between treatment and control: -0.008160617
cat("Chi-square statistic:", chisq$statistic, "\n")
## Chi-square statistic: 0.0995548
cat("Chi-square p-value:", chisq$p.value, "\n")
## Chi-square p-value: 0.7523645
# Simulate this 10,000 times (monte carlo simulation - see R Refresher for a hint)
set.seed(42)

simulate_randomization <- function(data, iterations = 10000) {
  results <- map_dbl(1:iterations, function(i) {
    random_assign <- sample(0:1, nrow(data), replace = TRUE)
    
    treat_prop <- mean(data$parent_Vote[random_assign == 1])
    
    ctrl_prop <- mean(data$parent_Vote[random_assign == 0])
    
    return(treat_prop - ctrl_prop)
  })
  
  return(results)
}

imbalance_distribution <- simulate_randomization(ypsps)

# Visualize the distribution of imbalances
imbalance_plot <- ggplot(data.frame(imbalance = imbalance_distribution), aes(x = imbalance)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Distribution of Treatment-Control Imbalances in Covariate: parent_Vote",
       x = "Imbalance (Treatment Proportion - Control Proportion)",
       y = "Frequency") +
  theme_minimal()

print(imbalance_plot)

significant_imbalance <- mean(abs(imbalance_distribution) > 0.05)
cat("Proportion of randomizations with imbalance > 0.05:", significant_imbalance, "\n")
## Proportion of randomizations with imbalance > 0.05: 0.0151
cat("Mean of imbalances:", mean(imbalance_distribution), "\n")
## Mean of imbalances: -0.0001778991
cat("Standard deviation of imbalances:", sd(imbalance_distribution), "\n")
## Standard deviation of imbalances: 0.02071033
cat("95% range of imbalances:", quantile(imbalance_distribution, c(0.025, 0.975)), "\n")
## 95% range of imbalances: -0.04090556 0.04082022

Questions

Your Answer:

Propensity Score Matching

One Model

Select covariates that you think best represent the “true” model predicting whether a student chooses to attend college, and estimate a propensity score model to calculate the Average Treatment Effect on the Treated (ATT). Plot the balance of the top 10 (or fewer if you select fewer covariates). Report the balance of the p-scores across both the treatment and control groups, and using a threshold of standardized mean difference of p-score \(\leq .1\), report the number of covariates that meet that balance threshold.

# Select covariates that represent the "true" model for selection, fit model
# covariates selection
post_cov_list <- c(names(ypsps[123:174])) #placebo variables are exclueded
invalid_post_covs <- c("student_1973CollegeDegree", 
                       "student_1973CurrentCollege", 
                       "student_1973CollegeYears")
post_cov_list <- setdiff(post_cov_list, invalid_post_covs)

baseline_cov_list <- c(names(ypsps[12:120]))
student_ext_efficacy <- c ("student_GovtOpinion", "student_GovtCrook", "student_GovtWaste",
                                         "student_TrGovt", "student_GovtSmart", "student_Govt4All")

parent_ext_efficacy <- c ("parent_GovtOpinion", "parent_GovtCrook", "parent_GovtWaste",
                                         "parent_TrGovt", "parent_GovtSmart", "parent_Govt4All")


student_personality <- c("student_Cynic", "student_LifeWish", "student_GLuck", "student_FPlans",
                         "student_EgoA", "student_WinArg", "student_StrOpinion", "student_MChange",
                         "student_EgoB","student_TrOthers","student_OthHelp","student_OthFair",
                         "student_Trust")

parent_personality <- c("parent_Cynic", "parent_LifeWish", "parent_GLuck", "parent_FPlans",
                         "parent_EgoA", "parent_WinArg", "parent_StrOpinion", "parent_MChange",
                         "parent_EgoB","parent_TrOthers","parent_OthHelp","parent_OthFair",
                         "parent_Trust")

excluded <- c (student_ext_efficacy, parent_ext_efficacy,

               student_personality, parent_personality)

cov_selected <- baseline_cov_list[!baseline_cov_list %in% excluded]
# Fit model
true_model <- ypsps %>% select(college, 
                               student_ppnscal,
                               all_of(cov_selected),
                               all_of(post_cov_list))
formula_str <- paste("college ~", paste(cov_selected, collapse = " + "))
match_formula <- as.formula(formula_str)

match_result <- matchit(match_formula, 
                 data = true_model, 
                 method = "nearest", 
                 distance = "logit")
## Warning: Fewer control units than treated units; not all treated units will get
## a match.
summary(match_result)
## 
## Call:
## matchit(formula = match_formula, data = true_model, method = "nearest", 
##     distance = "logit")
## 
## Summary of Balance for All Data:
##                     Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                   0.7971        0.3613          2.0592     0.5761
## student_PubAff             0.9465        0.9490         -0.0113          .
## student_Newspaper          1.9427        2.2328         -0.2209     0.8638
## student_Radio              2.6401        2.7761         -0.0769     0.9925
## student_TV                 2.2167        2.1308          0.0672     0.9957
## student_Magazine           1.6139        2.0133         -0.4578     0.8563
## student_FamTalk            1.8431        2.1131         -0.2932     0.7770
## student_FrTalk             1.9938        2.4169         -0.4468     0.7057
## student_AdultTalk          2.9614        2.9978         -0.0350     0.9310
## student_PID                3.6314        3.2262          0.2075     1.0602
## student_SPID               1.7360        1.7716         -0.0369     0.8865
## student_Senate             0.5965        0.3415          0.5199          .
## student_Tito               0.3599        0.1197          0.5004          .
## student_Court              0.4819        0.2639          0.4365          .
## student_Govern             0.9377        0.8226          0.4764          .
## student_CCamp              0.9278        0.7228          0.7917          .
## student_FDR                0.7472        0.5100          0.5458          .
## student_Knowledge          0.6752        0.4634          0.9300     0.9403
## student_NextSch            0.9577        0.6452          1.5515          .
## student_GPA                2.2740        2.6231         -0.5332     0.9101
## student_SchOfficer         2.0212        2.2661         -0.2956     0.9193
## student_SchPublish         1.7111        1.6098          0.0883     1.1830
## student_Hobby              1.3873        1.3747          0.0143     1.0345
## student_SchClub            1.7771        1.4390          0.3182     1.4143
## student_OccClub            1.7161        1.9024         -0.1678     0.8650
## student_NeighClub          1.7696        1.6009          0.1483     1.2445
## student_RelClub            2.5567        2.3969          0.1403     0.9893
## student_YouthOrg           1.7111        1.4745          0.2155     1.4039
## student_MiscClub           0.3313        0.1840          0.3128          .
## student_ClubLev            1.8281        1.4789          0.2775     1.4412
## student_Phone              0.9614        0.8803          0.4211          .
## student_Gen                0.5479        0.4324          0.2322          .
## student_Race               1.0822        1.0931         -0.0359     1.0438
## parent_Newspaper           3.3674        2.7938          0.4612     0.5821
## parent_Radio               2.4770        2.3038          0.0963     0.9456
## parent_TV                  3.2130        3.2639         -0.0417     0.9538
## parent_Magazine            0.6600        0.4678          0.4057          .
## parent_PID                 3.6015        3.2550          0.1579     1.1463
## parent_SPID                2.0100        1.9202          0.0930     0.8759
## parent_Vote                0.8867        0.7517          0.4259          .
## parent_Persuade            0.4583        0.2639          0.3902          .
## parent_Rally               0.2964        0.1375          0.3480          .
## parent_OthAct              0.1806        0.0998          0.2100          .
## parent_PolClub             0.0859        0.0399          0.1642          .
## parent_Button              0.3188        0.2328          0.1845          .
## parent_Money               0.2877        0.1020          0.4102          .
## parent_Participate1        0.2713        0.1460          0.4343     1.5136
## parent_Participate2        0.2339        0.1224          0.3839     1.6441
## parent_ActFrq              2.3686        1.3326          0.3173     1.4370
## parent_Employ              0.6961        0.6120          0.1830          .
## parent_EducHH              3.3238        2.1685          0.7880     1.7174
## parent_EducW               3.1071        2.2993          0.6663     1.4055
## parent_ChurchOrg           1.9228        1.5920          0.2870     1.2800
## parent_FratOrg             1.3674        1.2306          0.1756     1.2410
## parent_ProOrg              1.4570        1.1131          0.3986     3.2463
## parent_CivicOrg            1.1968        1.0643          0.2114     2.9880
## parent_CLOrg               1.0847        1.0399          0.1335     1.2890
## parent_NeighClub           1.3263        1.1996          0.1676     1.3259
## parent_SportClub           1.3674        1.2373          0.1486     1.3706
## parent_InfClub             1.4732        1.2550          0.2309     1.6225
## parent_FarmGr              1.6227        1.5011          0.2157     0.7734
## parent_WomenClub           1.6949        1.5588          0.1890     1.2089
## parent_MiscClub            0.1569        0.1264          0.0839          .
## parent_ClubLev             1.3288        1.2417          0.1054     1.2550
## parent_FInc                7.7920        6.5366          0.6299     0.9205
## parent_HHInc               7.2864        5.9047          0.6445     0.9501
## parent_OwnHome             0.8443        0.7428          0.2801          .
## parent_Senate              0.3811        0.1752          0.4240          .
## parent_Tito                0.5455        0.2838          0.5255          .
## parent_Court               0.2864        0.1596          0.2804          .
## parent_Govern              0.9651        0.8936          0.3901          .
## parent_CCamp               0.8991        0.7561          0.4749          .
## parent_FDR                 0.9552        0.9202          0.1691          .
## parent_Knowledge           0.6721        0.5314          0.6426     1.0145
## parent_Gen                 0.4471        0.3880          0.1188          .
## parent_Race                1.0797        1.1153         -0.1197     0.6858
##                     eCDF Mean eCDF Max
## distance               0.3851   0.6151
## student_PubAff         0.0026   0.0026
## student_Newspaper      0.0580   0.1240
## student_Radio          0.0272   0.0449
## student_TV             0.0147   0.0452
## student_Magazine       0.1331   0.2108
## student_FamTalk        0.0675   0.1018
## student_FrTalk         0.0846   0.1702
## student_AdultTalk      0.0157   0.0436
## student_PID            0.0579   0.0998
## student_SPID           0.0254   0.0353
## student_Senate         0.2550   0.2550
## student_Tito           0.2402   0.2402
## student_Court          0.2181   0.2181
## student_Govern         0.1151   0.1151
## student_CCamp          0.2049   0.2049
## student_FDR            0.2372   0.2372
## student_Knowledge      0.1815   0.3678
## student_NextSch        0.3124   0.3124
## student_GPA            0.0698   0.2328
## student_SchOfficer     0.0816   0.1873
## student_SchPublish     0.0253   0.0509
## student_Hobby          0.0059   0.0094
## student_SchClub        0.0845   0.1765
## student_OccClub        0.0466   0.0815
## student_NeighClub      0.0422   0.0711
## student_RelClub        0.0399   0.0637
## student_YouthOrg       0.0591   0.1005
## student_MiscClub       0.1472   0.1472
## student_ClubLev        0.0873   0.1357
## student_Phone          0.0811   0.0811
## student_Gen            0.1156   0.1156
## student_Race           0.0080   0.0174
## parent_Newspaper       0.1147   0.1669
## parent_Radio           0.0346   0.0527
## parent_TV              0.0161   0.0479
## parent_Magazine        0.1922   0.1922
## parent_PID             0.0495   0.1035
## parent_SPID            0.0224   0.0420
## parent_Vote            0.1350   0.1350
## parent_Persuade        0.1944   0.1944
## parent_Rally           0.1589   0.1589
## parent_OthAct          0.0808   0.0808
## parent_PolClub         0.0460   0.0460
## parent_Button          0.0860   0.0860
## parent_Money           0.1857   0.1857
## parent_Participate1    0.1074   0.2334
## parent_Participate2    0.0929   0.1909
## parent_ActFrq          0.0864   0.2264
## parent_Employ          0.0842   0.0842
## parent_EducHH          0.1925   0.3590
## parent_EducW           0.1346   0.2858
## parent_ChurchOrg       0.0827   0.1675
## parent_FratOrg         0.0352   0.1032
## parent_ProOrg          0.0860   0.1935
## parent_CivicOrg        0.0331   0.0694
## parent_CLOrg           0.0132   0.0488
## parent_NeighClub       0.0317   0.0880
## parent_SportClub       0.0325   0.0683
## parent_InfClub         0.0546   0.1079
## parent_FarmGr          0.0430   0.1468
## parent_WomenClub       0.0340   0.0856
## parent_MiscClub        0.0305   0.0305
## parent_ClubLev         0.0218   0.0473
## parent_FInc            0.1255   0.2776
## parent_HHInc           0.1382   0.2613
## parent_OwnHome         0.1015   0.1015
## parent_Senate          0.2059   0.2059
## parent_Tito            0.2616   0.2616
## parent_Court           0.1268   0.1268
## parent_Govern          0.0716   0.0716
## parent_CCamp           0.1430   0.1430
## parent_FDR             0.0350   0.0350
## parent_Knowledge       0.1206   0.2650
## parent_Gen             0.0590   0.0590
## parent_Race            0.0119   0.0298
## 
## Summary of Balance for Matched Data:
##                     Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                   0.9380        0.3613          2.7248     0.0283
## student_PubAff             0.9379        0.9490         -0.0492          .
## student_Newspaper          1.8426        2.2328         -0.2971     0.7944
## student_Radio              2.6120        2.7761         -0.0928     0.9892
## student_TV                 2.2616        2.1308          0.1024     1.0215
## student_Magazine           1.4279        2.0133         -0.6710     0.6562
## student_FamTalk            1.7627        2.1131         -0.3804     0.7565
## student_FrTalk             1.8426        2.4169         -0.6065     0.5875
## student_AdultTalk          2.9490        2.9978         -0.0469     0.9022
## student_PID                3.9246        3.2262          0.3576     1.1004
## student_SPID               1.7339        1.7716         -0.0390     0.9041
## student_Senate             0.7206        0.3415          0.7728          .
## student_Tito               0.4856        0.1197          0.7622          .
## student_Court              0.6186        0.2639          0.7100          .
## student_Govern             0.9756        0.8226          0.6331          .
## student_CCamp              0.9778        0.7228          0.9850          .
## student_FDR                0.8603        0.5100          0.8061          .
## student_Knowledge          0.7731        0.4634          1.3601     0.6427
## student_NextSch            0.9956        0.6452          1.7398          .
## student_GPA                2.1375        2.6231         -0.7417     0.8941
## student_SchOfficer         1.9091        2.2661         -0.4308     0.8607
## student_SchPublish         1.7761        1.6098          0.1449     1.2455
## student_Hobby              1.3792        1.3747          0.0050     0.9896
## student_SchClub            1.9645        1.4390          0.4947     1.5524
## student_OccClub            1.6186        1.9024         -0.2555     0.8079
## student_NeighClub          1.8514        1.6009          0.2202     1.3308
## student_RelClub            2.5654        2.3969          0.1480     0.9743
## student_YouthOrg           1.7982        1.4745          0.2948     1.5076
## student_MiscClub           0.3880        0.1840          0.4334          .
## student_ClubLev            1.9490        1.4789          0.3735     1.5445
## student_Phone              0.9845        0.8803          0.5409          .
## student_Gen                0.6142        0.4324          0.3653          .
## student_Race               1.0687        1.0931         -0.0800     0.9197
## parent_Newspaper           3.5344        2.7938          0.5954     0.4066
## parent_Radio               2.4656        2.3038          0.0900     0.9395
## parent_TV                  3.1463        3.2639         -0.0964     1.0011
## parent_Magazine            0.7561        0.4678          0.6085          .
## parent_PID                 3.9047        3.2550          0.2960     1.2417
## parent_SPID                2.0865        1.9202          0.1722     0.8132
## parent_Vote                0.9268        0.7517          0.5526          .
## parent_Persuade            0.5676        0.2639          0.6097          .
## parent_Rally               0.3836        0.1375          0.5390          .
## parent_OthAct              0.2217        0.0998          0.3170          .
## parent_PolClub             0.1020        0.0399          0.2215          .
## parent_Button              0.3481        0.2328          0.2474          .
## parent_Money               0.4080        0.1020          0.6759          .
## parent_Participate1        0.3385        0.1460          0.6673     1.6515
## parent_Participate2        0.2927        0.1224          0.5864     1.8440
## parent_ActFrq              2.9157        1.3326          0.4849     1.6467
## parent_Employ              0.7184        0.6120          0.2314          .
## parent_EducHH              3.9290        2.1685          1.2008     1.3561
## parent_EducW               3.5078        2.2993          0.9968     1.2510
## parent_ChurchOrg           2.0377        1.5920          0.3867     1.3029
## parent_FratOrg             1.4124        1.2306          0.2334     1.3603
## parent_ProOrg              1.6984        1.1131          0.6783     4.2912
## parent_CivicOrg            1.2794        1.0643          0.3432     4.0043
## parent_CLOrg               1.0953        1.0399          0.1653     1.6522
## parent_NeighClub           1.3636        1.1996          0.2170     1.4036
## parent_SportClub           1.4102        1.2373          0.1976     1.4670
## parent_InfClub             1.5632        1.2550          0.3261     1.8207
## parent_FarmGr              1.6829        1.5011          0.3227     0.6910
## parent_WomenClub           1.7539        1.5588          0.2708     1.2613
## parent_MiscClub            0.1796        0.1264          0.1463          .
## parent_ClubLev             1.3814        1.2417          0.1691     1.4486
## parent_FInc                8.4124        6.5366          0.9411     0.6216
## parent_HHInc               7.9712        5.9047          0.9639     0.6783
## parent_OwnHome             0.8692        0.7428          0.3486          .
## parent_Senate              0.4856        0.1752          0.6392          .
## parent_Tito                0.6741        0.2838          0.7837          .
## parent_Court               0.3880        0.1596          0.5052          .
## parent_Govern              0.9778        0.8936          0.4593          .
## parent_CCamp               0.9712        0.7561          0.7142          .
## parent_FDR                 0.9756        0.9202          0.2679          .
## parent_Knowledge           0.7454        0.5314          0.9776     0.8502
## parent_Gen                 0.4745        0.3880          0.1739          .
## parent_Race                1.0643        1.1153         -0.1715     0.5713
##                     eCDF Mean eCDF Max Std. Pair Dist.
## distance               0.5612   0.9424          2.7248
## student_PubAff         0.0111   0.0111          0.4432
## student_Newspaper      0.0780   0.1641          0.9657
## student_Radio          0.0328   0.0532          1.0004
## student_TV             0.0218   0.0621          0.9947
## student_Magazine       0.1951   0.3038          1.1031
## student_FamTalk        0.0876   0.1530          1.0738
## student_FrTalk         0.1149   0.2262          1.1124
## student_AdultTalk      0.0211   0.0665          1.0786
## student_PID            0.0998   0.1663          1.2023
## student_SPID           0.0249   0.0399          1.1135
## student_Senate         0.3792   0.3792          1.1435
## student_Tito           0.3659   0.3659          0.9101
## student_Court          0.3548   0.3548          1.0650
## student_Govern         0.1530   0.1530          0.7800
## student_CCamp          0.2550   0.2550          1.1049
## student_FDR            0.3503   0.3503          1.1326
## student_Knowledge      0.2654   0.5477          1.5126
## student_NextSch        0.3503   0.3503          1.7618
## student_GPA            0.0971   0.3215          1.2091
## student_SchOfficer     0.1190   0.2616          1.0945
## student_SchPublish     0.0416   0.0665          0.8672
## student_Hobby          0.0055   0.0111          0.7349
## student_SchClub        0.1314   0.2683          1.0081
## student_OccClub        0.0710   0.1197          0.9623
## student_NeighClub      0.0626   0.1042          0.9490
## student_RelClub        0.0421   0.0710          1.1331
## student_YouthOrg       0.0809   0.1419          0.8804
## student_MiscClub       0.2040   0.2040          0.9705
## student_ClubLev        0.1175   0.1840          0.9338
## student_Phone          0.1042   0.1042          0.6330
## student_Gen            0.1818   0.1818          0.9979
## student_Race           0.0126   0.0310          0.5308
## parent_Newspaper       0.1481   0.2062          1.0661
## parent_Radio           0.0324   0.0554          0.9998
## parent_TV              0.0271   0.0732          0.9910
## parent_Magazine        0.2882   0.2882          1.1140
## parent_PID             0.0928   0.1752          1.1788
## parent_SPID            0.0416   0.0621          1.0860
## parent_Vote            0.1752   0.1752          0.9023
## parent_Persuade        0.3038   0.3038          1.0903
## parent_Rally           0.2461   0.2461          0.8594
## parent_OthAct          0.1220   0.1220          0.7782
## parent_PolClub         0.0621   0.0621          0.4905
## parent_Button          0.1153   0.1153          0.9040
## parent_Money           0.3060   0.3060          0.9111
## parent_Participate1    0.1650   0.3282          1.0951
## parent_Participate2    0.1419   0.2816          1.0048
## parent_ActFrq          0.1319   0.3215          0.9617
## parent_Employ          0.1064   0.1064          0.9642
## parent_EducHH          0.2934   0.5100          1.2976
## parent_EducW           0.2014   0.4191          1.2345
## parent_ChurchOrg       0.1114   0.2350          0.9639
## parent_FratOrg         0.0455   0.1286          0.7173
## parent_ProOrg          0.1463   0.3259          0.8428
## parent_CivicOrg        0.0538   0.1153          0.4989
## parent_CLOrg           0.0139   0.0510          0.3900
## parent_NeighClub       0.0410   0.1175          0.6745
## parent_SportClub       0.0432   0.0931          0.6484
## parent_InfClub         0.0771   0.1552          0.7485
## parent_FarmGr          0.0610   0.2129          1.0152
## parent_WomenClub       0.0488   0.1286          0.9110
## parent_MiscClub        0.0532   0.0532          0.7072
## parent_ClubLev         0.0349   0.0665          0.6362
## parent_FInc            0.1876   0.4058          1.2392
## parent_HHInc           0.2067   0.4013          1.2990
## parent_OwnHome         0.1264   0.1264          0.8868
## parent_Senate          0.3104   0.3104          0.9679
## parent_Tito            0.3902   0.3902          1.1044
## parent_Court           0.2284   0.2284          0.8681
## parent_Govern          0.0843   0.0843          0.6769
## parent_CCamp           0.2151   0.2151          0.7878
## parent_FDR             0.0554   0.0554          0.5036
## parent_Knowledge       0.1834   0.3902          1.2781
## parent_Gen             0.0865   0.0865          0.9321
## parent_Race            0.0170   0.0443          0.5742
## 
## Sample Sizes:
##           Control Treated
## All           451     803
## Matched       451     451
## Unmatched       0     352
## Discarded       0       0
# Plot the balance for the top 10 covariates

ps_model <- match_result$model
coef_vec <- coef(ps_model)
coef_vec <- coef_vec[names(coef_vec) != "(Intercept)"]


top10_names <- names(sort(abs(coef_vec), decreasing = TRUE))[1:10]
top10_df <- data.frame(
  Variable = top10_names,
  Coefficient = coef_vec[top10_names]
)


top10_df$Variable <- factor(top10_df$Variable, levels = top10_df$Variable[order(abs(top10_df$Coefficient))])


ggplot(top10_df, aes(x = Variable, y = Coefficient, fill = Coefficient > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_manual(values = c("red", "steelblue")) +
  labs(
    title = "Top 10 Covariates Influencing Propensity Score",
    y = "Logit Coefficient",
    x = NULL
  ) +
  theme_minimal() +
  theme(axis.text = element_text(size = 12))

bal <- bal.tab(match_result, un = TRUE)
balance_df <- bal$Balance
balance_df$varname <- rownames(balance_df)

vars_to_plot <- intersect(top10_names, balance_df$varname)
top10_df <- balance_df[balance_df$varname %in% vars_to_plot, ]

plot_df <- top10_df %>%
  select(varname, Diff.Un, Diff.Adj) %>%
  pivot_longer(cols = c(Diff.Un, Diff.Adj),
               names_to = "match_status",
               values_to = "smd") %>%
  mutate(match_status = recode(match_status,
                               "Diff.Un" = "Before Matching",
                               "Diff.Adj" = "After Matching"))
ggplot(plot_df, aes(x = smd, y = reorder(varname, abs(smd)), fill = match_status)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6) +
  labs(
    title = "Standardized Mean Differences of Top 10 Covariates",
    x = "Standardized Mean Difference",
    y = "Covariates",
    fill = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

bal_data_top10 <- balance_df %>%
  filter(varname %in% top10_names) %>%
  select(varname, Diff.Un, Diff.Adj)

bal_data_long <- bal_data_top10 %>%
  pivot_longer(cols = c(Diff.Un, Diff.Adj),
               names_to = "sample",
               values_to = "std_diff") %>%
  mutate(sample = factor(sample, 
                        levels = c("Diff.Un", "Diff.Adj"),
                        labels = c("Before Matching", "After Matching")),
         varname = factor(varname, levels = rev(top10_names)))

ggplot(bal_data_long, aes(x = std_diff, y = varname, color = sample)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_vline(xintercept = 0, linetype = "solid") +
  geom_vline(xintercept = c(-0.1, 0.1), linetype = "dashed") +
  scale_color_manual(values = c("Before Matching" = "#00BFC4", "After Matching" = "#F8766D")) +
  labs(title = "Covariate Balance (Top 10)",
       x = "Standardized Mean Differences",
       y = "",
       color = "Sample") +
  theme_minimal() +
  theme(legend.position = "right")

# Report the overall balance and the proportion of covariates that meet the balance threshold
overall_balance <- data.frame(
  Metric = c("Mean Absolute SMD", "Maximum SMD", "Number of Variables with SMD <= 0.1"),
  Before = c(mean(abs(bal$Balance$Diff.Un), na.rm = TRUE),
             max(abs(bal$Balance$Diff.Un), na.rm = TRUE),
             sum(abs(bal$Balance$Diff.Un) <= 0.1, na.rm = TRUE)),
  After = c(mean(abs(bal$Balance$Diff.Adj), na.rm = TRUE),
            max(abs(bal$Balance$Diff.Adj), na.rm = TRUE),
            sum(abs(bal$Balance$Diff.Adj) <= 0.1, na.rm = TRUE))
)


overall_long <- overall_balance %>%
  pivot_longer(cols = c("Before", "After"),
               names_to = "Matching",
               values_to = "Value") %>%
  mutate(Matching = paste(Matching, "Matching"))


ggplot(overall_long, aes(x = Matching, y = Value, fill = Matching)) +
  geom_col() +
  facet_wrap(~ Metric, scales = "free_y") +
  labs(title = "Overall Balance Before and After Matching",
       x = NULL,
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "none")

pscores <- match_result$distance
treat <- ypsps$college
weights <- match_result$weights

matched_df <- data.frame(
  pscore = pscores,
  treat = as.factor(treat),
  weights = weights
) %>% filter(weights > 0)  


ggplot(matched_df, aes(x = pscore, fill = treat)) +
  geom_density(alpha = 0.5) +
  labs(title = "Propensity Score Distribution After Matching",
       x = "Propensity Score",
       y = "Density",
       fill = "Treatment Group") +
  theme_minimal()

pscore_row <- bal$Balance["distance",]
print(pscore_row) 
##              Type  Diff.Un Diff.Adj
## distance Distance 2.059219 2.724827
balance_df <- bal$Balance
balance_vars <- balance_df[!rownames(balance_df) %in% c("(Intercept)", "distance"), ]

num_balanced <- sum(abs(balance_vars$Diff.Adj) <= 0.1, na.rm = TRUE)
cat("Number of covariates with adjusted SMD ≤ 0.1:", num_balanced, "\n")
## Number of covariates with adjusted SMD ≤ 0.1: 13
# Calculate ATT
matched_data <- match.data(match_result)

formula_str <- paste("student_ppnscal ~ college + ", paste(post_cov_list, collapse = " + "))

att_model <- lm(as.formula(formula_str),
                data = matched_data, 
                weights = matched_data$weights)


summary(att_model)
## 
## Call:
## lm(formula = as.formula(formula_str), data = matched_data, weights = matched_data$weights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2997 -0.4984 -0.1063  0.6811  2.3199 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -4.0210552  3.7886481  -1.061   0.2951  
## college                       0.6413103  0.6142713   1.044   0.3029  
## student_1973Married          -0.6753605  0.6584745  -1.026   0.3114  
## student_1973Military                 NA         NA      NA       NA  
## student_1973Drafted          -0.0423950  0.5061875  -0.084   0.9337  
## student_1973Unemployed       -1.2528318  1.2882781  -0.972   0.3368  
## student_1973NoEmployers      -0.0727079  0.2307692  -0.315   0.7544  
## student_1973OwnHome          -0.5631424  0.4720372  -1.193   0.2401  
## student_1973NoResidences     -0.0282115  0.1293188  -0.218   0.8284  
## student_1973VoteNixon         0.7611735  0.5354969   1.421   0.1631  
## student_1973VoteMcgovern      1.1703161  0.6429628   1.820   0.0764 .
## student_1973HelpMinority      0.2013538  0.1644424   1.224   0.2281  
## student_1973Busing            0.1849767  0.1656493   1.117   0.2710  
## student_1973GovChange         0.0607760  0.1452737   0.418   0.6780  
## student_1973VietnamRight     -0.1711530  0.2123329  -0.806   0.4251  
## student_1973VietnamApprove   -0.0871764  0.8941264  -0.097   0.9228  
## student_1973Trust             0.0180392  0.5303082   0.034   0.9730  
## student_1973Luck              0.2301609  0.6682564   0.344   0.7324  
## student_1973SureAboutLife     0.4800236  0.5053488   0.950   0.3480  
## student_1973CurrentSituation -0.1803493  0.2003861  -0.900   0.3736  
## student_1973FutureSituation   0.2179359  0.2724169   0.800   0.4286  
## student_1973ThermMilitary     0.0046484  0.0123741   0.376   0.7092  
## student_1973ThermRadical     -0.0002792  0.0147822  -0.019   0.9850  
## student_1973ThermDems         0.0010365  0.0174936   0.059   0.9531  
## student_1973ThermRep          0.0139304  0.0198720   0.701   0.4875  
## student_1973ThermBlack        0.0190089  0.0150632   1.262   0.2145  
## student_1973ThermWhite       -0.0087086  0.0205166  -0.424   0.6736  
## student_1973ThermNixon       -0.0168620  0.0122282  -1.379   0.1758  
## student_1973ThermMcgovern    -0.0126486  0.0137646  -0.919   0.3638  
## student_1973Newspaper         0.7787715  0.6726325   1.158   0.2540  
## student_1973PubAffairs        0.0752827  0.4444871   0.169   0.8664  
## student_1973GovtEfficacy      0.0786220  0.8943957   0.088   0.9304  
## student_1973GovtNoSay        -0.1297694  0.1787903  -0.726   0.4723  
## student_1973PartyID           0.2035204  0.4841291   0.420   0.6765  
## student_1973IncSelf           0.0705983  0.0939086   0.752   0.4567  
## student_1973HHInc            -0.0065840  0.0923574  -0.071   0.9435  
## student_1973ChurchAttend      0.1739939  0.2106269   0.826   0.4138  
## student_1973Knowledge         0.2064580  0.2000747   1.032   0.3085  
## student_1973Ideology         -0.0450217  0.1699292  -0.265   0.7924  
## student_1982vote76           -0.1133216  0.7344083  -0.154   0.8782  
## student_1982vote80            0.1652541  0.7196804   0.230   0.8196  
## student_1982meeting           1.1148247  0.5782924   1.928   0.0612 .
## student_1982other             0.1246178  0.7645355   0.163   0.8714  
## student_1982button           -1.2702421  0.7524163  -1.688   0.0994 .
## student_1982money             0.6232329  0.6625337   0.941   0.3527  
## student_1982communicate       0.4490441  0.5209073   0.862   0.3939  
## student_1982demonstrate      -0.7494754  1.0430937  -0.719   0.4767  
## student_1982community         0.3430722  0.5096304   0.673   0.5048  
## student_1982IncSelf           0.0008095  0.0130224   0.062   0.9508  
## student_1982HHInc            -0.0028103  0.0117578  -0.239   0.8123  
## student_1982College           0.4703658  0.5493637   0.856   0.3971  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 39 degrees of freedom
##   (813 observations deleted due to missingness)
## Multiple R-squared:  0.7055, Adjusted R-squared:  0.3356 
## F-statistic: 1.907 on 49 and 39 DF,  p-value: 0.01964

Simulations

Henderson/Chatfield argue that an improperly specified propensity score model can actually the bias of the estimate. To demonstrate this, they simulate 800,000 different propensity score models by choosing different permutations of covariates. To investigate their claim, do the following:

run_ps_simulation <- function(data, treatment, outcome, 
                              baseline_covs, post_covs) {

  valid_baseline_covs <- baseline_covs[!(baseline_covs %in% c(treatment, outcome))]
  valid_baseline_covs <- Filter(function(var) {
    if (!var %in% names(data)) return(FALSE)
    if (length(unique(data[[var]])) <= 1) return(FALSE)
    TRUE
  }, valid_baseline_covs)
  
  if (length(valid_baseline_covs) == 0) {
    warning("No available baseline covariates")
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = NA))
  }
  
  n_covs <- sample(1:length(valid_baseline_covs), 1)
  ps_covs <- sample(valid_baseline_covs, n_covs)
  
  ps_formula <- as.formula(paste(treatment, "~", paste(ps_covs, collapse = " + ")))
  
  ps_match <- try(matchit(ps_formula, 
                          data = data,
                          method = "nearest",
                          distance = "logit",
                          replace = TRUE,
                          ratio = 1),
                  silent = TRUE)
  if (inherits(ps_match, "try-error")) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  matched_data <- try(match.data(ps_match), silent = TRUE)
  if (inherits(matched_data, "try-error") || nrow(matched_data) == 0) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  valid_post_covs <- post_covs[!(post_covs %in% c(treatment, outcome))]
  post_formula <- as.formula(
    paste(outcome, "~", treatment, 
          if (length(valid_post_covs) > 0) paste("+", paste(valid_post_covs, collapse = " + ")) else "")
  )
  
  m <- try(lm(post_formula, data = matched_data, weights = matched_data$weights), silent = TRUE)
  if (inherits(m, "try-error")) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  att <- try(coef(m)[treatment], silent = TRUE)
  if (inherits(att, "try-error") || is.na(att)) {
    att <- NA
  }
  
  bal <- try(bal.tab(ps_match, un = TRUE, m.threshold = 0.1), silent = TRUE)
  if (inherits(bal, "try-error")) {
    return(list(att = att, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  smd_table <- bal$Balance
  
  if ("M.Threshold.Adj" %in% names(smd_table)) {
    prop_balanced <- mean(smd_table$M.Threshold.Adj, na.rm = TRUE)
  } else {
    prop_balanced <- mean(abs(smd_table$Diff.Adj) <= 0.1, na.rm = TRUE)
  }
  
  if (!any(is.na(smd_table$Diff.Un)) && !any(is.na(smd_table$Diff.Adj))) {
    pct_improvements <- (smd_table$Diff.Un - smd_table$Diff.Adj) / smd_table$Diff.Un * 100
    mean_pct_improvement <- mean(pct_improvements, na.rm = TRUE)
  } else {
    mean_pct_improvement <- NA
  }
  
  
  return(list(att = att, 
              prop_balanced = prop_balanced, 
              mean_pct_improvement = mean_pct_improvement,
              n_covs = length(ps_covs),
              covs_used = ps_covs))
}
set.seed(42)


n_sims <- 10000
chunk_size <- 2500
n_chunks <- ceiling(n_sims / chunk_size)


# Pre-select 10 random indices for which detailed model information will be stored
selected_model_indices <- sample(1:n_sims, 10)
detailed_models <- vector("list", 10)


results_list <- vector("list", n_chunks)

for (chunk_i in seq_len(n_chunks)) {
  start_idx <- (chunk_i - 1) * chunk_size + 1
  end_idx   <- min(chunk_i * chunk_size, n_sims)
  indices   <- start_idx:end_idx

  block_results <- vector("list", length(indices))
  
  for (j in seq_along(indices)) {
    sim_index <- indices[j]
    
    if (j %% 500 == 0) {
      cat(sim_index, "/10000 \n")
    }
    
   
    result <- run_ps_simulation(
      data = ypsps,
      treatment = "college",
      outcome = "student_ppnscal",
      baseline_covs = baseline_cov_list,
      post_covs = post_cov_list
    )
    
    
    block_results[[j]] <- result
    
    
    if (sim_index %in% selected_model_indices) {
      
      detailed_model <- run_ps_simulation(
        data = ypsps,
        treatment = "college",
        outcome = "student_ppnscal",
        baseline_covs = baseline_cov_list,
        post_covs = post_cov_list
      )
      
      
      ps_formula <- as.formula(paste("college ~", paste(result$covs_used, collapse = " + ")))
      ps_match <- try(matchit(ps_formula, 
                             data = ypsps,
                             method = "nearest",
                             distance = "logit",
                             replace = TRUE,
                             ratio = 1),
                     silent = TRUE)
      
      
      detailed_model$ps_model <- ps_match
      detailed_model$sim_index <- sim_index
      detailed_models[[which(selected_model_indices == sim_index)]] <- detailed_model
    }
  }

block_df <- do.call(rbind, lapply(seq_along(block_results), function(j) {
    res <- block_results[[j]]
    data.frame(
      sim_id = indices[j],
      att = res$att, 
      prop_balanced = res$prop_balanced,
      mean_pct_improvement = res$mean_pct_improvement,
      n_covs = res$n_covs,
      stringsAsFactors = FALSE
    )
  }))
  
  block_df <- block_df[!is.na(block_df$att), ]
  
  results_list[[chunk_i]] <- block_df
  
  rm(block_results, block_df)
  gc()
}
## 500 /10000 
## 1000 /10000 
## 1500 /10000 
## 2000 /10000 
## 2500 /10000 
## 3000 /10000 
## 3500 /10000 
## 4000 /10000 
## 4500 /10000 
## 5000 /10000 
## 5500 /10000 
## 6000 /10000 
## 6500 /10000 
## 7000 /10000 
## 7500 /10000 
## 8000 /10000 
## 8500 /10000 
## 9000 /10000 
## 9500 /10000 
## 10000 /10000
results <- do.call(rbind, results_list)
# Plot ATT v. proportion
p1 <- ggplot(results, aes(x = prop_balanced, y = att)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess", color = "red") +
  labs(x = "Proportion of Balanced Covariates (SMD ≤ 0.1)",
       y = "Average Treatment Effect on Treated (ATT)",
       title = "Relationship Between Balance and Treatment Effect") +
  theme_minimal()

# Plot showing relationship between balance proportion and mean percent improvement
p2 <- ggplot(results, aes(x = mean_pct_improvement, y = prop_balanced)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess", color = "red") +
  labs(x = "Mean % Improvement in SMD",
       y = "Proportion of Balanced Covariates",
       title = "Balance Improvement vs Proportion Balanced") +
  theme_minimal()

# Display the plots
print(p1)
## `geom_smooth()` using formula = 'y ~ x'

print(p2)
## `geom_smooth()` using formula = 'y ~ x'

# 10 random covariate balance plots (hint try gridExtra)
# Note: ggplot objects are finnicky so ask for help if you're struggling to automatically create them; consider using functions!
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(grid)
library(stringr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
create_balance_plot <- function(ps_model, sim_id) {

  sum_obj <- try(summary(ps_model), silent = TRUE)
  if (inherits(sum_obj, "try-error")) {
    return(NULL)
  }
  

  if ("sum.all" %in% names(sum_obj)) {
    bal_data <- sum_obj$sum.all
    

    var_names <- rownames(bal_data)
    unadj_col <- "Std. Mean Diff."  
    

    if ("sum.matched" %in% names(sum_obj)) {
      matched_data <- sum_obj$sum.matched
      adj_col <- "Std. Mean Diff."  
      

      plot_data <- data.frame(
        variable   = var_names,
        Before = bal_data[, unadj_col],
        After   = matched_data[, adj_col]
      )
      

      if ("distance" %in% plot_data$variable) {
        plot_data <- subset(plot_data, variable != "distance")
      }
      

      plot_data_long <- melt(
        plot_data, 
        id.vars       = "variable",
        variable.name = "Sample",
        value.name    = "SMD"
      )
      
 

      p <- ggplot(plot_data_long, aes(x = SMD, y = variable, color = Sample)) +
        geom_point(size = 3, alpha = 0.8) +
       
        geom_vline(xintercept = 0, linetype = "solid") +
        geom_vline(xintercept = c(-0.1, 0.1), linetype = "dashed") +
       
        labs(
          title = paste("Model (Sim", sim_id, ")"),
          x = "SMD",
          y = NULL
        ) +
        scale_color_manual(values = c("Before" = "#45B6AA", "After" = "#F08080")) +
        theme_bw(base_size = 12) +
        theme(
          legend.position   = "right",
          plot.title        = element_text(hjust = 0.5, face = "bold"),
          panel.grid.minor  = element_blank()
        ) +
        coord_flip()  
      
      return(p)
    }
  }
  
}
plot_balance_pages <- function(balance_plots, n_per_page = 4) {
  n_total <- length(balance_plots)
  if (n_total < 1) return(invisible(NULL))
  
  n_pages <- ceiling(n_total / n_per_page)
  
  for (page_i in seq_len(n_pages)) {
    start_idx <- (page_i - 1) * n_per_page + 1
    end_idx   <- min(page_i * n_per_page, n_total)
    plots_subset <- balance_plots[start_idx:end_idx]
    
    grid_title <- textGrob(
      paste("Covariate Balance - Page", page_i, "of", n_pages),
      gp = gpar(fontsize = 16, fontface = "bold")
    )
    
 
    layout <- grid.arrange(
      grid_title,
      do.call(arrangeGrob, c(plots_subset, list(ncol = 2))),
      nrow    = 2,
      heights = c(0.1, 0.9)
    )
    

    print(layout)
  }
}
balance_plots <- list()


for (i in seq_along(detailed_models)) {
  model_info <- detailed_models[[i]]
  

  if (is.null(model_info)) {
    next
  }
  
  ps_model <- model_info$ps_model
  if (is.null(ps_model) || !inherits(ps_model, "matchit")) {
    next
  }
  
  sim_id <- model_info$sim_index
  

  cat("\nModel", i, "of", length(detailed_models), "(Simulation index =", sim_id, ")\n")
  cat("ATT:", model_info$att, "\n")
  cat("Proportion balanced:", model_info$prop_balanced, "\n")
  cat("Mean % improvement:", model_info$mean_pct_improvement, "\n")
  cat("Number of covariates used:", model_info$n_covs, "\n\n")
  

  p <- create_balance_plot(ps_model, sim_id)
  

  if (!is.null(p)) {
    balance_plots[[length(balance_plots) + 1]] <- p
  }
}
## 
## Model 1 of 10 (Simulation index = 2369 )
## ATT: 1.206867 
## Proportion balanced: 0.6395349 
## Mean % improvement: 86.79993 
## Number of covariates used: 85 
## 
## 
## Model 2 of 10 (Simulation index = 5273 )
## ATT: 0.2664646 
## Proportion balanced: 0.7272727 
## Mean % improvement: 70.57799 
## Number of covariates used: 21 
## 
## 
## Model 3 of 10 (Simulation index = 9290 )
## ATT: 0.9239538 
## Proportion balanced: 0.6101695 
## Mean % improvement: 68.31241 
## Number of covariates used: 58 
## 
## 
## Model 4 of 10 (Simulation index = 1252 )
## ATT: 1.24271 
## Proportion balanced: 0.4631579 
## Mean % improvement: 53.4 
## Number of covariates used: 94 
## 
## 
## Model 5 of 10 (Simulation index = 8826 )
## ATT: 1.30599 
## Proportion balanced: 1 
## Mean % improvement: 116.8036 
## Number of covariates used: 5 
## 
## 
## Model 6 of 10 (Simulation index = 356 )
## ATT: 0.6980617 
## Proportion balanced: 0.8888889 
## Mean % improvement: 207.6002 
## Number of covariates used: 17 
## 
## 
## Model 7 of 10 (Simulation index = 7700 )
## ATT: 0.9549326 
## Proportion balanced: 0.5897436 
## Mean % improvement: 28.7686 
## Number of covariates used: 77 
## 
## 
## Model 8 of 10 (Simulation index = 3954 )
## ATT: 1.248126 
## Proportion balanced: 0.4761905 
## Mean % improvement: 80.00658 
## Number of covariates used: 83 
## 
## 
## Model 9 of 10 (Simulation index = 9091 )
## ATT: 0.9832987 
## Proportion balanced: 0.7878788 
## Mean % improvement: 81.93241 
## Number of covariates used: 32 
## 
## 
## Model 10 of 10 (Simulation index = 5403 )
## ATT: 0.8576852 
## Proportion balanced: 0.6351351 
## Mean % improvement: 95.19026 
## Number of covariates used: 73
plot_balance_pages(balance_plots, n_per_page = 4)

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                grob
## 1 1 (1-1,1-1) arrange text[GRID.text.496]
## 2 2 (2-2,1-1) arrange     gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                grob
## 1 1 (1-1,1-1) arrange text[GRID.text.699]
## 2 2 (2-2,1-1) arrange     gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                grob
## 1 1 (1-1,1-1) arrange text[GRID.text.900]
## 2 2 (2-2,1-1) arrange     gtable[arrange]

Questions

extract_smd <- function(ps_model) {
  sum_obj <- try(summary(ps_model), silent = TRUE)
  if (inherits(sum_obj, "try-error")) return(NULL)

  if (!all(c("sum.all", "sum.matched") %in% names(sum_obj))) return(NULL)

  bal_before <- sum_obj$sum.all
  bal_after  <- sum_obj$sum.matched

  df_before <- data.frame(
    variable = rownames(bal_before),
    SMD_before = bal_before[,"Std. Mean Diff."],
    stringsAsFactors = FALSE
  )
  df_after <- data.frame(
    variable = rownames(bal_after),
    SMD_after = bal_after[,"Std. Mean Diff."],
    stringsAsFactors = FALSE
  )

  df_merged <- merge(df_before, df_after, by = "variable", all = TRUE)
  df_merged <- df_merged[ !df_merged$variable %in% c("distance", "(Intercept)"), ]
  return(df_merged)
}
smd_all_models <- data.frame()

for (i in seq_along(detailed_models)) {
  model_info <- detailed_models[[i]]
  if (is.null(model_info)) next
  
  ps_model   <- model_info$ps_model
  sim_id     <- model_info$sim_index
  
  if (!inherits(ps_model, "matchit")) next
  
  smd_df <- extract_smd(ps_model)
  if (is.null(smd_df) || nrow(smd_df) == 0) next
  
  smd_df$model_index <- i
  smd_df$sim_id <- sim_id
  
  smd_all_models <- rbind(smd_all_models, smd_df)
}
common_vars <- smd_all_models %>%
  count(variable) %>%
  filter(n >= 7) %>%
  pull(variable)

smd_common_long <- smd_all_models %>%
  filter(variable %in% common_vars) %>%
  pivot_longer(cols = c(SMD_before, SMD_after),
               names_to = "phase",
               values_to = "SMD")

common_plots <- lapply(common_vars, function(varname) {
  df_var <- smd_common_long %>% filter(variable == varname)

  ggplot(df_var, aes(x = factor(model_index), y = SMD, color = phase)) +
    geom_point(position = position_dodge(width = 0.4), size = 2, alpha = 0.75) +
    geom_hline(yintercept = c(-0.1, 0.1), linetype = "dashed", color = "gray50") +
    scale_color_manual(
      values = c("SMD_before" = "#45B6AA", "SMD_after" = "#F08080" ),
      labels = c("SMD_before" = "Before", "SMD_after" = "After")
    ) +
    labs(
      title = varname,
      x = "Model Index",
      y = "SMD",
      color = NULL
    ) +
    theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "bottom"
    )
})

plot_balance_pages(common_plots, n_per_page = 4)

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.1001]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.1178]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.1355]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

Matching Algorithm of Your Choice

Simulate Alternative Model

Henderson/Chatfield propose using genetic matching to learn the best weights for Mahalanobis distance matching. Choose a matching algorithm other than the propensity score (you may use genetic matching if you wish, but it is also fine to use the greedy or optimal algorithms we covered in lab instead). Repeat the same steps as specified in Section 4.2 and answer the following questions:

run_optimal_simulation <- function(data, treatment, outcome, 
                              baseline_covs, post_covs) {
  
  old_warn <- options("warn")
  options(warn = -1)
  on.exit(options(old_warn))

  valid_baseline_covs <- baseline_covs[!(baseline_covs %in% c(treatment, outcome))]
  valid_baseline_covs <- Filter(function(var) {
    if (!var %in% names(data)) return(FALSE)
    if (length(unique(data[[var]])) <= 1) return(FALSE)
    TRUE
  }, valid_baseline_covs)
  
  if (length(valid_baseline_covs) == 0) {
    warning("No available baseline covariates")
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = NA))
  }
  
  n_covs <- sample(1:length(valid_baseline_covs), 1)
  ps_covs <- sample(valid_baseline_covs, n_covs)
  
  ps_formula <- as.formula(paste(treatment, "~", paste(ps_covs, collapse = " + ")))
  
  ps_match <- try(matchit(ps_formula,
                          data = data,
                          method = "optimal",
                          distance = "logit",
                          ratio = 1), silent = TRUE)
  
  if (inherits(ps_match, "try-error")) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  matched_data <- try(match.data(ps_match), silent = TRUE)
  if (inherits(matched_data, "try-error") || nrow(matched_data) == 0) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  valid_post_covs <- post_covs[!(post_covs %in% c(treatment, outcome))]
  post_formula <- as.formula(
    paste(outcome, "~", treatment, 
          if (length(valid_post_covs) > 0) paste("+", paste(valid_post_covs, collapse = " + ")) else "")
  )
  
  m <- try(lm(post_formula, data = matched_data, weights = matched_data$weights), silent = TRUE)
  if (inherits(m, "try-error")) {
    return(list(att = NA, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  att <- try(coef(m)[treatment], silent = TRUE)
  if (inherits(att, "try-error") || is.na(att)) {
    att <- NA
  }
  
  bal <- try(bal.tab(ps_match, un = TRUE, m.threshold = 0.1), silent = TRUE)
  if (inherits(bal, "try-error")) {
    return(list(att = att, prop_balanced = NA, mean_pct_improvement = NA,
                covs_used = ps_covs))
  }
  
  smd_table <- bal$Balance
  
  if ("M.Threshold.Adj" %in% names(smd_table)) {
    prop_balanced <- mean(smd_table$M.Threshold.Adj, na.rm = TRUE)
  } else {
    prop_balanced <- mean(abs(smd_table$Diff.Adj) <= 0.1, na.rm = TRUE)
  }
  
  if (!any(is.na(smd_table$Diff.Un)) && !any(is.na(smd_table$Diff.Adj))) {
    pct_improvements <- (smd_table$Diff.Un - smd_table$Diff.Adj) / smd_table$Diff.Un * 100
    mean_pct_improvement <- mean(pct_improvements, na.rm = TRUE)
  } else {
    mean_pct_improvement <- NA
  }
  
  
  return(list(att = att, 
              prop_balanced = prop_balanced, 
              mean_pct_improvement = mean_pct_improvement,
              n_covs = length(ps_covs),
              covs_used = ps_covs))
}
set.seed(42)


n_sims <- 10000
chunk_size <- 1000
n_chunks <- ceiling(n_sims / chunk_size)


# Pre-select 10 random indices for which detailed model information will be stored
selected_model_indices <- sample(1:n_sims, 10)
detailed_models <- vector("list", 10)


results_list <- vector("list", n_chunks)

for (chunk_i in seq_len(n_chunks)) {
  start_idx <- (chunk_i - 1) * chunk_size + 1
  end_idx   <- min(chunk_i * chunk_size, n_sims)
  indices   <- start_idx:end_idx

  block_results <- vector("list", length(indices))
  
  for (j in seq_along(indices)) {
    sim_index <- indices[j]
    
    if (j %% 500== 0) {
      cat(sim_index, "/10000 \n")
    }
    
   
    result <- run_optimal_simulation(
      data = ypsps,
      treatment = "college",
      outcome = "student_ppnscal",
      baseline_covs = baseline_cov_list,
      post_covs = post_cov_list
    )
    
    
    block_results[[j]] <- result
    
    
    if (sim_index %in% selected_model_indices) {
      
      detailed_model <- run_optimal_simulation(
        data = ypsps,
        treatment = "college",
        outcome = "student_ppnscal",
        baseline_covs = baseline_cov_list,
        post_covs = post_cov_list
      )
      
      
      ps_formula <- as.formula(paste("college ~", paste(result$covs_used, collapse = " + ")))
      ps_match <- try(matchit(ps_formula, 
                             data = ypsps,
                             method = "optimal",
                             distance = "logit",
                             ratio = 1),
                     silent = TRUE)
      
      
      detailed_model$ps_model <- ps_match
      detailed_model$sim_index <- sim_index
      detailed_models[[which(selected_model_indices == sim_index)]] <- detailed_model
    }
  }

block_df <- do.call(rbind, lapply(seq_along(block_results), function(j) {
    res <- block_results[[j]]
    if(is.null(res$n_covs) || length(res$n_covs) == 0) {
      res$n_covs <- NA
    }
    
    data.frame(
      sim_id = indices[j],
      att = ifelse(is.null(res$att) || length(res$att) == 0, NA, res$att), 
      prop_balanced = ifelse(is.null(res$prop_balanced) || length(res$prop_balanced) == 0, NA, res$prop_balanced),
      mean_pct_improvement = ifelse(is.null(res$mean_pct_improvement) || length(res$mean_pct_improvement) == 0, NA, res$mean_pct_improvement),
      n_covs = ifelse(is.null(res$n_covs) || length(res$n_covs) == 0, NA, res$n_covs),
      stringsAsFactors = FALSE
    )
  }))
  
  block_df <- block_df[!is.na(block_df$att), ]
  
  results_list[[chunk_i]] <- block_df
  
  rm(block_results, block_df)
  gc()
}
## 500 /10000 
## 1000 /10000
## 1500 /10000 
## 2000 /10000
## 2500 /10000 
## 3000 /10000 
## 3500 /10000
## 4000 /10000 
## 4500 /10000 
## 5000 /10000
## 5500 /10000 
## 6000 /10000 
## 6500 /10000 
## 7000 /10000 
## 7500 /10000
## 8000 /10000 
## 8500 /10000
## 9000 /10000
## 9500 /10000 
## 10000 /10000
optimal_results <- do.call(rbind, results_list)
# Visualization for distributions of percent improvement
# Plot ATT v. proportion
p1 <- ggplot(optimal_results, aes(x = prop_balanced, y = att)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess", color = "red") +
  labs(x = "Proportion of Balanced Covariates (SMD ≤ 0.1)",
       y = "Average Treatment Effect on Treated (ATT)",
       title = "Relationship Between Balance and Treatment Effect") +
  theme_minimal()


# Plot showing relationship between balance proportion and mean percent improvement
p2 <- ggplot(optimal_results, aes(x = mean_pct_improvement, y = prop_balanced)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess", color = "red") +
  labs(x = "Mean % Improvement in SMD",
       y = "Proportion of Balanced Covariates",
       title = "Balance Improvement vs Proportion Balanced") +
  theme_minimal()

# Display the plots
print(p1)
## `geom_smooth()` using formula = 'y ~ x'

print(p2)
## `geom_smooth()` using formula = 'y ~ x'

# 10 random covariate balance plots (hint try gridExtra)
# Note: ggplot objects are finnicky so ask for help if you're struggling to automatically create them; consider using functions!

balance_plots <- list()


for (i in seq_along(detailed_models)) {
  model_info <- detailed_models[[i]]
  

  if (is.null(model_info)) {
    next
  }
  
  ps_model <- model_info$ps_model
  if (is.null(ps_model) || !inherits(ps_model, "matchit")) {
    next
  }
  
  sim_id <- model_info$sim_index
  

  cat("\nModel", i, "of", length(detailed_models), "(Simulation index =", sim_id, ")\n")
  cat("ATT:", model_info$att, "\n")
  cat("Proportion balanced:", model_info$prop_balanced, "\n")
  cat("Mean % improvement:", model_info$mean_pct_improvement, "\n")
  cat("Number of covariates used:", model_info$n_covs, "\n\n")
  

  p <- create_balance_plot(ps_model, sim_id)
  

  if (!is.null(p)) {
    balance_plots[[length(balance_plots) + 1]] <- p
  }
}
## 
## Model 1 of 10 (Simulation index = 2369 )
## ATT: 0.5231764 
## Proportion balanced: 0.5 
## Mean % improvement: 42.6445 
## Number of covariates used: 85 
## 
## 
## Model 2 of 10 (Simulation index = 5273 )
## ATT: 0.4822615 
## Proportion balanced: 0.5454545 
## Mean % improvement: 53.20505 
## Number of covariates used: 21 
## 
## 
## Model 3 of 10 (Simulation index = 9290 )
## ATT: 0.4980466 
## Proportion balanced: 0.5254237 
## Mean % improvement: 37.19984 
## Number of covariates used: 58 
## 
## 
## Model 4 of 10 (Simulation index = 1252 )
## ATT: -0.1792967 
## Proportion balanced: 0.4421053 
## Mean % improvement: 34.3151 
## Number of covariates used: 94 
## 
## 
## Model 5 of 10 (Simulation index = 8826 )
## ATT: 1.009405 
## Proportion balanced: 0.6666667 
## Mean % improvement: 81.21125 
## Number of covariates used: 5 
## 
## 
## Model 6 of 10 (Simulation index = 356 )
## ATT: -0.1900758 
## Proportion balanced: 0.5555556 
## Mean % improvement: -16.23733 
## Number of covariates used: 17 
## 
## 
## Model 7 of 10 (Simulation index = 7700 )
## ATT: 0.02746174 
## Proportion balanced: 0.5128205 
## Mean % improvement: 49.1513 
## Number of covariates used: 77 
## 
## 
## Model 8 of 10 (Simulation index = 3954 )
## ATT: 0.3793831 
## Proportion balanced: 0.5 
## Mean % improvement: 59.76281 
## Number of covariates used: 83 
## 
## 
## Model 9 of 10 (Simulation index = 9091 )
## ATT: 0.583686 
## Proportion balanced: 0.6666667 
## Mean % improvement: 41.84172 
## Number of covariates used: 32 
## 
## 
## Model 10 of 10 (Simulation index = 5403 )
## ATT: 0.2071296 
## Proportion balanced: 0.5675676 
## Mean % improvement: 61.3828 
## Number of covariates used: 73
plot_balance_pages(balance_plots, n_per_page = 4)

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.1614]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.1815]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name                 grob
## 1 1 (1-1,1-1) arrange text[GRID.text.2016]
## 2 2 (2-2,1-1) arrange      gtable[arrange]

Questions

compare_df <- bind_rows(
  results %>% select(mean_pct_improvement) %>% mutate(method = "PS"),
  optimal_results %>% select(mean_pct_improvement) %>% mutate(method = "Optimal")
)

compare_df <- compare_df %>% filter(!is.na(mean_pct_improvement))

ggplot(compare_df, aes(x = mean_pct_improvement, fill = method, color = method)) +
  geom_density(alpha = 0.4, adjust = 1.2) +
  labs(
    title = "Density of Mean % Improvement in SMD",
    x = "Mean % Improvement (SMD)",
    y = "Density",
    fill = "Matching Method",
    color = "Matching Method"
  ) +
  theme_minimal(base_size = 13) +
  scale_fill_manual(values = c("PS" = "#1b9e77", "Optimal" = "#d95f02")) +
  scale_color_manual(values = c("PS" = "#1b9e77", "Optimal" = "#d95f02")) +
  theme(legend.position = "top")

Looking ahead to the discussion questions, you may choose to model the propensity score using an algorithm other than logistic regression and perform these simulations again, if you wish to explore the second discussion question further.

Discussion Questions